Datasets were searched for using the same filters as before - must be human tissue that expressed lewy bodies, must have at least 3 patients and 3 controls, and must…
10 datasets were identified:
| CodeID | Tissue/Cell type | Variant | Samples | Type of array | GEO Ref | Comments | PIs |
|---|---|---|---|---|---|---|---|
| DIJ | Substantia Nigra | not specified | 15 disease, 8 control | Affymetrix Human Genome U133 Plus 2.0 Array | GSE49036 | Has some Braak staging | Dijkstra AA, Ingrassia AI, de Menezes RX, van Kesteren RE, Rozemuller AM, Heutink P, vandeBerg WJ |
| FFR | Substantia Nigra | not specified | 9 replicates for the controls and 16 replicates for the Parkinson’s disease patients | Affymetrix Human Genome U133 Plus 2.0 Array | GSE7621 | Ffrench-Mullen JM | |
| MOR.SN | substantia nigra, split into medial and lateral portions | sporadic | 15 samples of medial parkinsonian SN, 9 samples of lateral parkinsonian SN (24), 8 medial nigra control samples and 7 lateral nigra control samples(15) | Affymetrix Human Genome U133A Array | GSE8397 | Age and sex | Moran LB, Graeber MB |
| MID1 | Substantia Nigra pars compacta | not specified | 10 PD brain samples and 8 control | Affymetrix Human Genome U133 Plus 2.0 Array | GSE20141 | Middleton FA, Kim PD, Zhang-James Y, Davis RL | |
| MID2 | whole substantia nigra | not specified | 18 controls, 11 patients | Affymetrix Human Genome U133A Array | GSE20292 | Age and sex | Middleton FA, James M, Zhang Y, Davis RL |
| DUM | Cortex (BA9) | not specified | 29 PD, 44 neurologically normal controls | Illumina HiSeq 2000 | GSE68719 | Have age of death, sex, and some samples have corresponding proteomics | Dumitriu A, Golji J, Labadorf AT, Gao B, Beach TG, Myers RH, Longo KA, Latourelle JC |
| MOR.FC | Frontal cortex | sporadic | 3 Controls 5 Patients | Affymetrix Human Genome U133A Array | GSE8397 | Age and sex | Moran LB, Graeber MB |
| LEW | two regions of the medulla | not specified | 14 PD samples and 8 controls (two regions) | Affymetrix Human Genome U133A 2.0 Array | GSE19587 | Age and sex | lewandowski nm, small sa |
| MID3 | prefrontal cortex | not specified | 15 controls 14 patients | Affymetrix Human Genome U133A Array | GSE20168 | Age and sex | Middleton FA, James M, Zhang Y, Davis RL |
| MID4 | putamen | not specified | 20 controls and 15 patients | Affymetrix Human Genome U133A Array | GSE20291 | Age and sex | Middleton FA, James M, Zhang Y, Davis RL |
Checked data distribution using PCA and boxplots. Some of the datasets have quite mixed samples wereas others separate quite nicely into controls and patients. Boxplots didn’t suggest any samples were outliers.
As with the TDP-43 data, microarray data was analysed using Limma. The genes were divided into positive and negative fold change. Intersected gene lists can be found in /Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression/FoldChange.
| ~~~ | ~~~ | ~~~ | ~~~ | ~~~ | ~~~ | ~~~ | ~~~ | ~~~ | ~~~ | ~~~ |
|---|---|---|---|---|---|---|---|---|---|---|
| A4GNT | C3 | CPED1 | FAM46A | HCK | KDM4A | LY6G5C | NRDE2 | RBMS2 | SMIM14 | TNFRSF10B |
| ABCC6 | C3AR1 | CPM | FAM60A | HDC | KDM4C | LYVE1 | NYX | RCC1 | SNAP23 | TNFRSF10D |
| AC012065.7 | C8orf60 | CPSF3L | FANCE | HERC2P3 | KIAA0101 | LYZL6 | OR1F1 | RDH16 | SND1-IT1 | TNFRSF25 |
| ACP5 | C9orf3 | CROCC | FAS | HIST1H1T | KIAA0226L | MAF | ORM1 | RHBDF2 | SOGA1 | TNFSF14 |
| ACSBG2 | CALML4 | CROCCP3 | FASLG | HIST1H2AB | KIAA0485 | MAFF | OSM | RIBC2 | SORBS1 | TNN |
| ACSS3 | CASP1 | CSPG4 | FCGBP | HIST1H2AD | KIAA1614 | MAP2K7 | P4HA1 | RIPK2 | SOS2 | TP73-AS1 |
| ADH1B | CASP2 | CTBP2 | FCGR2A | HMOX1 | KIAA1661 | MAPKBP1 | PAGE1 | RNASE6 | SOX12 | TRAF1 |
| ADIPOQ | CASP4 | CTD-2269F5.1 | FCGR2B | HNRNPM | KIF13A | MAVS | PAPOLA | RP11-217B7.2 | SOX2 | TRAPPC10 |
| ADORA3 | CASP6 | CXCR4 | FCRL2 | HOXB8 | KLHL36 | MBD5 | PARP16 | RP11-255C15.3 | SOX9 | TRIM21 |
| AF007147 | CASP7 | CXorf21 | FGF17 | HP1BP3 | KLK2 | MBTD1 | PAWR | RP11-403P17.4 | SP140L | TRIM5 |
| AGAP10 | CASP8 | CXorf36 | FGF3 | HS3ST1 | KRI1 | MCM3AP-AS1 | PCDH12 | RP11-548H18.2 | SPG21 | TTLL5 |
| AGRP | CATSPER2 | CYP21A1P | FGFR1 | HSPA1A | KRT19P2 | MECOM | PCDHGA8 | RPA4 | SPINK1 | TUBD1 |
| ALOX5 | CCDC101 | CYP39A1 | FKBP10 | HSPA1L | L2HGDH | MED13L | PDLIM1 | RPL23AP32 | SPTLC3 | TULP2 |
| ALOX5AP | CCDC15 | CYTH4 | FLCN | HSPA6 | LAG3 | METTL4 | PELI2 | RPL23AP53 | SRP19 | TYMP |
| ALPK1 | CCDC40 | DAPP1 | FLJ21369 | HSPD1 | LAMB2 | METTL7A | PGF | RREB1 | SRSF5 | TYROBP |
| AMELX | CCL17 | DBF4B | FLT1 | IFNA2 | LAT2 | MFAP5 | PGPEP1 | RRNAD1 | SSTR3 | UGGT1 |
| AMELY | CCL22 | DCLRE1C | FLT3LG | IFNA21 | LECT1 | MICB | PIDD1 | RUNX1-IT1 | ST20 | UPK3B |
| AMH | CCL27 | DDR1-AS1 | FMO5 | IGF1R | LEF1 | MID1 | PIGO | RUNX3 | ST6GAL1 | USP34 |
| ANGPT2 | CD14 | DENND2D | FNDC3B | IGFBP5 | LENEP | MILR1 | PILRA | S100A11P1 | STAG3L3 | VPS54 |
| ANP32A-IT1 | CD163 | DFFB | FOXD1 | IGKV1-17 | LEPREL1 | MIR612 | PIM1 | S100A4 | STIP1 | VSIG4 |
| AP1G2 | CD207 | DIP2A | FOXL2 | IGLL3P | LEPREL4 | MKRN4P | PLAC8 | SAFB2 | STK10 | WBSCR16 |
| APOBEC3C | CD22 | DISC1 | FPR1 | IGSF9B | LGALS9 | MS4A6A | PLEKHF2 | SAT1 | STOM | WDR4 |
| APOC2 | CD247 | DKFZP434A062 | FYB | IL12RB1 | LILRA2 | MSH5 | PLGLB1 | SCAF4 | TAF4 | WDR52 |
| AQP8 | CD300A | DMC1 | FZD5 | IL16 | LINC00115 | MSX1 | PLIN2 | SCD5 | TAL1 | WDR55 |
| ARHGAP25 | CD37 | DNA2 | FZD7 | IL17RB | LINC00260 | MT1M | PLK3 | SCIN | TAS2R10 | WDR78 |
| ARHGDIB | CD48 | DNAJB1 | FZD9 | IL18 | LINC00894 | MYH7B | PLOD2 | SCNN1B | TAS2R13 | XRCC2 |
| ARHGEF4 | CD68 | DNAJB6 | G0S2 | IL1R1 | LINC00963 | MYO1C | PLTP | SDCCAG3 | TAZ | ZBED2 |
| ARID3A | CD84 | DNASE2 | G3BP1 | IL21 | LLGL2 | MZT2B | PODNL1 | SERPINA1 | TBX6 | ZC3HAV1 |
| ATAD2B | CDA | DOCK10 | GABPA | INE1 | LOC100272216 | NABP1 | POGLUT1 | SERPINH1 | TBXAS1 | ZFP36L1 |
| ATHL1 | CDC14A | DOCK2 | GAL3ST4 | INPP5B | LOC100506699 | NACAP1 | POGZ | SERTAD3 | TCF3 | ZMYND8 |
| ATP8B1 | CDK2AP2 | DSE | GAS1 | INPP5D | LOC101928198 | NBEAL2 | PPP1R14D | SH2B2 | TFAP2B | ZNF14 |
| ATXN7 | CEBPD | DTYMK | GATA1 | INVS | LOC101928274 | NBR2 | PPP2R1B | SHMT1 | TFEC | ZNF217 |
| AVIL | CEL | EEF1DP5 | GCKR | IRF4 | LOC101929240 | NCKAP1L | PRR11 | SIGLEC7 | TFPI | ZNF224 |
| AZGP1 | CHD2 | EFNA4 | GDF15 | IRF6 | LOC101929889 | NEUROG1 | PRRG4 | SIGLEC8 | TGFBR3 | ZNF235 |
| AZGP1P1 | CHKB-CPT1B | EGF | GDF9 | IRF7 | LOC102724229 | NFATC3 | PSCA | SIM2 | TGIF1 | ZNF34 |
| AZU1 | CHRNG | ELF4 | GEM | ITFG2 | LOC102725016 | NFE2 | PSG11 | SIX6 | TIMM44 | ZNF354A |
| BANP | CHST4 | EMC10 | GJA9-MYCBP | ITGB6 | LOC284009 | NHLH1 | PTPN14 | SLA | TLR5 | ZNF430 |
| BC069804 | CLEC2B | ENTPD1 | GJD2 | ITPR3 | LOC729164 | NOTCH2NL | PTPRCAP | SLAMF8 | TM4SF1 | ZNF446 |
| BMP8B | CLIC2 | ENTPD7 | GK2 | JMJD6 | LOC79999 | NOX3 | RAB13 | SLC11A1 | TMEM176A | ZNF516 |
| BRPF1 | CLK1 | EP300 | GP9 | KCNE4 | LPAR2 | NPIPA1 | RAC2 | SLC27A3 | TMEM19 | ZNF665 |
| BTBD18 | COL16A1 | EPOR | GPR25 | KCNJ8 | LPAR4 | NPIPB15 | RAD52 | SLC2A5 | TMEM51 | ZNF670 |
| C10orf12 | COL5A1 | ETV1 | GRAMD3 | KCNK5 | LPP | NPL | RAD54L | SLC35C2 | TMPRSS11E | ZNF710 |
| C11orf16 | COL7A1 | FAM106A | GTSE1 | KCNQ1 | LRRC1 | NPR3 | RBL1 | SLC7A9 | TMPRSS5 | ZNF721 |
| C1QB | COL9A1 | FAM118A | GUSBP3 | KCTD5 | LRRFIP1 | NR6A1 | RBM38 | SMAD6 | TNFAIP8 | ZNF74 |
| ZNF783 |
| ~~~ | ~~~ | ~~~ | ~~~ | ~~~ | ~~~ | ~~~ | ~~~ | ~~~ | ~~~ | ~~~ |
|---|---|---|---|---|---|---|---|---|---|---|
| ABCA11P | ATP5D | CLINT1 | EIF2B3 | GPN1 | LOC728392 | MZT2A | PFDN2 | PSMD8 | SLC9A1 | TSSC1 |
| ABCF1 | ATP5F1 | CLIP3 | EIF2S1 | GPX4 | LRBA | NAPA | PFKM | PSME3 | SLITRK5 | TTC19 |
| ABHD11 | ATP5G1 | CLTB | EIF3F | GRAMD1B | LRPPRC | NDEL1 | PFN2 | PSMG1 | SNCA | TTC9 |
| ACAT1 | ATP6AP1 | CMAS | EIF3I | GSS | LYRM4 | NDFIP1 | PGAM1 | PTBP2 | SNRPN | TUBA1B |
| ACAT2 | ATP6V0B | COA3 | EIF3K | GSTA4 | LYRM9 | NDUFA10 | PGRMC1 | PTDSS1 | SNX24 | TUBA1C |
| ACP2 | ATP6V0C | COA7 | EIF6 | GSTO1 | MAGED1 | NDUFA13 | PHB2 | PTP4A1 | SORD | TUBB |
| ACTB | ATP6V1E1 | COMMD9 | ELOVL6 | HARS | MAGED2 | NDUFA3 | PINK1 | PTPRA | SPAG16 | TUBB2A |
| ACTR1A | ATP6V1F | COPS3 | ELP3 | HERC6 | MAK16 | NDUFA7 | PITPNA | RAB14 | SPAG7 | TUBB3 |
| ADAM23 | ATP6V1G2 | COPS4 | EMC7 | HINT1 | MAP1B | NDUFA9 | PITPNB | RABEPK | SPCS1 | TUBG1 |
| AFG3L2 | ATP6V1H | COPS5 | ENDOD1 | HLTF | MAP1LC3B | NDUFAB1 | PITRM1 | RAN | SPCS3 | TUFM |
| AGTPBP1 | AURKAIP1 | COPS7A | ENOPH1 | HMCES | MAPK10 | NDUFAF1 | PKI55 | RANGRF | SPRYD7 | TUSC2 |
| AHCY | AVPI1 | COQ3 | ENSA | HMGN4 | MDH1 | NDUFB5 | PLD3 | RARS | SRPRB | TXNDC15 |
| AIFM1 | B3GNT2 | COX7A2L | ENTPD6 | HNRNPA0 | MDH2 | NDUFS1 | PLEKHB2 | RBFOX2 | SRRD | UBB |
| AIMP2 | BABAM1 | COX8A | ERCC1 | HPCAL4 | ME2 | NDUFS3 | PMPCA | RBM3 | SSSCA1 | UBE2E3 |
| AJAP1 | BECN1 | CREB3 | ERP29 | HSPA12A | ME3 | NDUFS5 | PMS2 | RITA1 | ST3GAL5 | UBE3B |
| AK055981 | BEND5 | CRMP1 | EXOSC9 | IARS | MECR | NDUFV1 | PMS2P1 | RNASEH1 | STAT4 | UCHL1 |
| AK5 | BEX4 | CRY2 | FABP3 | IARS2 | MFN2 | NDUFV2 | PMS2P5 | RNF10 | STRAP | UQCRC1 |
| AKAP12 | BFSP1 | CSNK2A1 | FAF1 | IDH3B | MGST3 | NEDD8 | POLR2C | RPA2 | STS | UQCRFS1 |
| AKAP6 | BPGM | CSRNP2 | FAM127A | IDH3G | MICU1 | NEU1 | POLR3C | RPH3A | STX12 | UQCRQ |
| AKTIP | BRE | CTNNA2 | FAM134C | IDI1 | MIEF1 | NGFRAP1 | POP4 | RPP14 | STXBP1 | UROD |
| ALAS1 | BRF2 | CUL2 | FAM162A | IMP4 | MIF | NHP2 | PPFIA2 | RPS6KC1 | SUMO3 | UROS |
| ALDH1A1 | BSN | CX3CL1 | FAM168A | INSIG1 | MIR21 | NIF3L1 | PPIA | RRAGA | SYN1 | USP7 |
| ANXA2 | BTBD3 | CXorf40A | FAM206A | IQSEC1 | MIR3656 | NIPSNAP1 | PPME1 | RTCB | SYT11 | UTP18 |
| ANXA6 | C11orf49 | CXorf40A | FBXW2 | IRGQ | MIR4784 | NIT2 | PPP1R7 | RTF1 | TACO1 | VAMP7 |
| AP1S1 | C12orf10 | CXorf40B | FDPS | ITFG1 | MIR636 | NME1 | PPP2CA | RTN4 | TCEA2 | VLDLR |
| AP2M1 | C1orf216 | CYC1 | FECH | ITPA | MIR6890 | NOP16 | PPP2R1A | RWDD2B | TCEAL2 | VPS41 |
| AP2S1 | C21orf33 | CYP2E1 | FHIT | ITPR1 | MLH1 | NRXN3 | PPP2R2B | SAMM50 | TCEAL4 | VPS4B |
| AP3M2 | C3orf62 | DCTN3 | FIBP | KATNB1 | MMP24-AS1 | NSDHL | PPP2R5D | SARS | TFCP2 | VPS51 |
| AP3S2 | C5orf30 | DCTN6 | FIG4 | KHDRBS1 | MOCS2 | NUDCD3 | PPP3CB | SCAMP5 | TFPT | WBP11 |
| APEH | C6orf106 | DCTPP1 | FKBP1B | KIAA0391 | MOSPD1 | NUDT2 | PRC1 | SCFD1 | TIMM10 | WDR44 |
| APEX1 | CCBL2 | DDA1 | FLRT1 | KIAA0513 | MPI | NUP155 | PRKCZ | SDHA | TIMM10B | WDR61 |
| APLP2 | CCDC51 | DDHD2 | FRMPD4 | KIAA1279 | MRPL15 | ODC1 | PRMT8 | SDHAF1 | TIMM8B | WDR7 |
| APMAP | CCK | DDX1 | FTO | KIF2A | MRPL35 | ODF2 | PRPF4 | SDHAP1 | TMED3 | WDR77 |
| AREL1 | CCNH | DDX24 | FUCA1 | KIF3C | MRPL4 | OPTN | PRR13 | SEC31A | TMEM177 | WDYHV1 |
| ARF1 | CCSER2 | DDX42 | GABARAPL1 | KIFAP3 | MRPS18A | ORC5 | PSD3 | SEH1L | TMEM208 | YWHAZ |
| ARF3 | CDC123 | DGUOK | GABBR1 | KLHDC4 | MRPS22 | OSBP | PSMA1 | SERINC3 | TMEM246 | ZDHHC4 |
| ARFGAP2 | CDC42 | DHRS11 | GALNT11 | L1CAM | MRPS33 | OXCT1 | PSMA5 | SEZ6L2 | TMEM41B | ZDHHC6 |
| ARHGEF9 | CDK20 | DHRS7B | GARS | LANCL1 | MRPS35 | OXLD1 | PSMB3 | SF3A3 | TMEM97 | ZMYM4 |
| ARL2BP | CDK5 | DIABLO | GBAS | LANCL2 | MRPS7 | PAAF1 | PSMB4 | SF3B5 | TMUB2 | ZNF365 |
| ARMC2-AS1 | CDK7 | DKK3 | GGNBP2 | LBH | MRTO4 | PAFAH1B1 | PSMB5 | SH3BP5 | TOMM20 | ZNF593 |
| ARPC1A | CDS2 | DLD | GLO1 | LCMT1 | MTCH1 | PARL | PSMB6 | SIPA1L1 | TOR1AIP2 | ZNF629 |
| ARPC5L | CERK | DNAJC8 | GLOD4 | LDHA | MTCH2 | PARN | PSMB7 | SLC23A2 | TOX4 | |
| ASS1 | CFL1 | DRG1 | GLT8D1 | LDHB | MTCL1 | PCCB | PSMC1 | SLC25A11 | TP53BP1 | |
| ASTN2 | CHCHD2 | DSTN | GMPR2 | LDLRAD4 | MTHFD1 | PCYOX1L | PSMC2 | SLC25A3 | TPGS2 | |
| ATG14 | CHP1 | DUSP26 | GNB1 | LDOC1 | MTPAP | PDCL3 | PSMC5 | SLC25A46 | TRAP1 | |
| ATP1A1 | CHST1 | DYNC1H1 | GNB5 | LETMD1 | MTX2 | PDHB | PSMD1 | SLC25A5 | TRAPPC2L | |
| ATP1A3 | CIAPIN1 | DYNC1I1 | GOT1 | LINC00094 | MX1 | PDK2 | PSMD11 | SLC25A6 | TSG101 | |
| ATP2B2 | CIRBP | DZIP3 | GOT2 | LMO3 | MYH10 | PDZD8 | PSMD14 | SLC30A9 | TSPAN3 | |
| ATP5A1 | CITED1 | EAPP | GPD1L | LOC101927673 | MYL12B | PEX14 | PSMD2 | SLC35B1 | TSPAN7 | |
| ATP5B | CLCN6 | EIF2AK1 | GPI | LOC645166 | MYL5 | PEX19 | PSMD7 | SLC6A1 | TSR2 |
Mouse data was sourced from GEO https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4758, referenced as CON, however during the analysis I realised that alpha synuclein had a positive log fold change, even though every other dataset reported a negative log fold change for sporadic parkinson’s disease. Looking back at the dataset I realised it was an overexpression model rather than a mutant. Woops!
What’s interesting is that canonically, alpha synuclein has been suggested to be overexpressed in sporadic parkinson’s patients.
As was done with the TDP-43 pipeline, I had to find other signals that could be contaminating the sporadic signal. I decided to use the ALS datasets from the TDP-43 analysis to represent “general neurodegeneration”, (C9orf72, sALS, PET, RAV, SOD1, FUS) but actually there was very little overlap:
ALOX5 ALPK1 STOM MAFF COL7A1
PMS2P1 ATP6V1G2 INSIG1 RRAGA TPGS2 MTCH2 PPP2CA GLO1 PDHB
In a similar vein I decided to source some sporadic Parkinson’s disease blood - to make sure it was leaving a more tissue specific signal. I found two datasets on GEO - GSE99039 (AMA) and GSE72267 (RON). These were bigger datasets than I usually use - AMA was total 353 samples, RON 59. What I realised was that my filters for the presence calls were still 2. This meant that 24,000 genes/probes passed through filtering which was making huge overlaps. I realised a filter of 2 wasn’t really proportional to the number of samples in AMA, so I increased this to 20 (as the dataset is approximately 10 times larger than the other datasets I have been using).
These two datasets produced a common gene expression signature of approximately 5000 genes, which is quite a lot of concordance. When you compare this to the brain signal, there is an overlap of 88 upregulated and 76 downregulated genes. This is quite a small overlap considering the size of the blood signature.
At this point, the remaining gene list for the sporadic signature is 558 genes (look for Filt_ALS_blood files)
Down http://amp.pharm.mssm.edu/Enrichr/enrich?dataset=3rwmj
Up http://amp.pharm.mssm.edu/Enrichr/enrich?dataset=3rwmr
There was a dataset I labelled “CON” with the GSEnumber GSE4758 that I included for a little while as it was a mouse model of alpha synuclein which I later removed as the research question changed.
Oliver Bandmann mentioned that there is a phenomenon that LRRK2 patients exhibit their disease more closely to sporadic patients than other familial patients. I thought this was interesting and could be a good application of my methdology.
I sourced two datasets with LRRK2 patient data - GSE34516 and GSE23290. These were exon array so had to first be preprocessed by Wenbin in the Affymetrix Console using RMA-sketch. The output was normalised expression data. Mas5Calls didn’t seem to work so I skipped this step.
#### BOT GSE34516 #########
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Data/")
Data <- read.csv("GSE34516.RMA-GENE-EXTENDED-EC-hg19-na36_LRRK2only.csv", row.names = 1)
genename <- as.data.frame(Data$GeneSymbol)
rownames(genename) <- rownames(Data)
colnames(genename) <- "Gene.Symbol"
#Data is already normalised using RMA-sketch
analysis.name<-"BOT" #Label analysis
expressionMatrix<-Data[,1:6] #takes expression from normalised expression set
Treat<-factor(rep(c("Control", "Patient"),c(4,2)), levels=c("Control", "Patient"))
design<-model.matrix(~Treat)
rownames(design)<-colnames(expressionMatrix)
design
#Conduct statistical analysis of expression
library(limma)
fit<-lmFit(expressionMatrix, design) #linear model fit
fit<-eBayes(fit)
result<-topTable(fit, coef="TreatPatient", adjust="BH", number=nrow(expressionMatrix)) #"BH" adjust for multiple hypothesis testing
#toptable normally takes top number but this takes all
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression")
write.csv(result, file = paste(analysis.name, "_result.csv", sep = ""))
result$"Fold Change"<-2^result$logFC
result$"Fold Change"[result$"Fold Change"<1]<-(-1)/result$"Fold Change"[result$"Fold Change"<1] #converts log fold change into a linear value above or below 0
expressionLinear<-as.data.frame(2^expressionMatrix)
expressionLinear$ProbeSetID<-rownames(expressionLinear)
result<-merge(result, expressionLinear, by = 0) #merge values into one array
result<-merge(result, genename, by.x = "Row.names", by.y = 0)
result<-subset(result, subset=(Gene.Symbol !="---")) #if no gene symbol, discount
setwd(dir = "/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression/")
write.csv(result, file=paste(analysis.name, "_finalresult.csv", sep=""), row.names=FALSE, quote = FALSE)
genesort <- result[order(result$P.Value),]
uniqueresult <- genesort[!duplicated(genesort[,16]),]
write.csv(uniqueresult, file=paste(analysis.name, "rankeduniqueresult.csv", sep=""), row.names=FALSE, quote = FALSE)
#### BOT2 GSE23290 #########
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Data/")
Data <- read.csv("GSE23290.RMA-GENE-EXTENDED-EC-hg19-na36_LRRK2only.csv", row.names = 1)
genename <- as.data.frame(Data$GeneSymbol)
rownames(genename) <- rownames(Data)
colnames(genename) <- "Gene.Symbol"
#Data is already normalised using RMA-sketch
analysis.name<-"BOT2" #Label analysis
expressionMatrix<-Data[,1:8] #takes expression from normalised expression set
Treat<-factor(rep(c("Control", "Patient"),c(5,3)), levels=c("Control", "Patient"))
design<-model.matrix(~Treat)
rownames(design)<-colnames(expressionMatrix)
design
#Conduct statistical analysis of expression
library(limma)
fit<-lmFit(expressionMatrix, design) #linear model fit
fit<-eBayes(fit)
result<-topTable(fit, coef="TreatPatient", adjust="BH", number=nrow(expressionMatrix)) #"BH" adjust for multiple hypothesis testing
#toptable normally takes top number but this takes all
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression")
write.csv(result, file = paste(analysis.name, "_result.csv", sep = ""))
result$"Fold Change"<-2^result$logFC
result$"Fold Change"[result$"Fold Change"<1]<-(-1)/result$"Fold Change"[result$"Fold Change"<1] #converts log fold change into a linear value above or below 0
expressionLinear<-as.data.frame(2^expressionMatrix)
expressionLinear$ProbeSetID<-rownames(expressionLinear)
result<-merge(result, expressionLinear, by = 0) #merge values into one array
result<-merge(result, genename, by.x = "Row.names", by.y = 0)
result<-subset(result, subset=(Gene.Symbol !="---")) #if no gene symbol, discount
setwd(dir = "/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression/")
write.csv(result, file=paste(analysis.name, "_finalresult.csv", sep=""), row.names=FALSE, quote = FALSE)
genesort <- result[order(result$P.Value),]
uniqueresult <- genesort[!duplicated(genesort[,18]),]
write.csv(uniqueresult, file=paste(analysis.name, "rankeduniqueresult.csv", sep=""), row.names=FALSE, quote = FALSE)
I had to skip out a number of steps because it didn’t require the annotation file. There were about 28000 rows left over which was quite a lot, though approximately 7000 were LOC transcripts so took us down to the number of genes you would expect.
After adding these two datasets to the FoldChange_PD.R script, and filtering out for ALS and sporadic PD blood, there were 85 upregulated and 109 downregulated genes. This still included MSX1, SNCA, UCHL1, and ALDH1A1 from PD Malacards but not PINK1 anymore.
EnrichR UP: http://amp.pharm.mssm.edu/Enrichr/enrich?dataset=3t9ff DOWN: http://amp.pharm.mssm.edu/Enrichr/enrich?dataset=3t9fh
The other datasets I had were fibroblasts from PARK2 and LRRK2 patients as provided by Robin Highly/Oliver Bandmann/Heather Mortiboys. These were exon array datasets which were analysed by Wenbin using the Affymetrix console. When these were added to the filters, the number of common genes was 36, only one of which MSX1 was retained.
To give a little more signal, but to also see if the signal could be recovered, I discluded the blood data (as fibroblast is representative enough of a non-neuronal tissue), leaving 54 genes
DOCK2 RAD52 TNFAIP8 P4HA1 MS4A6A CD300A KLHL36 TRIM5 CD163 CXCR4 RIPK2 PAWR ZNF516 JMJD6 CASP7 MSX1 IGF1R LRRC1 SAFB2 C1QB IL17RB LAMB2 CCDC40 CDA FMO5 DNASE2 NFE2 ZNF217 HMOX1 PARP16 HP1BP3 PGPEP1 CD14 NPIPB15 IMP4 LETMD1 AP1S1 FHIT MECR ATP2B2 DCTPP1 WDYHV1 PSMC2 NRXN3 MRTO4 NDUFS3 SLC6A1 NME1 MRPL15 COPS3 TUBG1 PRC1 BFSP1 IARS
Approximately 10 of these genes have a relatively strong relationship with Parkinson’s or LRRK2 specifically. 32 were directly related according to VarElect, 22 indirectly related (VarElect_Results_cmgreen1-sheffield-ac-uk-20180522-014621619).
The next step was to find the PPI partners of these 54 genes.
#### ALS PARK2 LRRK2 ####
library(biomaRt)
setwd("/Users/clairegreen/Documents/PhD/TDP-43/TDP-43_Code/Results/PPI_Network/")
PPI <- read.table("iref14_Human_UP_noDup_table_nodash.txt", header = T)
braingenes <- read.csv("Zhang_BrainCelltype_Markers_braingenes.csv", header = T)
DEG_list <- readLines("~/Documents/PhD/Parkinsons/Parkinsons_Code/Results/ALSPARK2LRRK2/removeallALLgenes.txt")
mart <- useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl", host="www.ensembl.org")
attributes <- listAttributes(mart)
mart_back <- getBM(attributes =c("hgnc_symbol", "uniprotswissprot"), filters="hgnc_symbol", values=DEG_list, mart=mart)
genelist_Uniprot <- subset(mart_back, !(mart_back$uniprotswissprot == ""))
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/ALSPARK2LRRK2/")
write.csv(genelist_Uniprot, "martback.csv", row.names = F)
# IDENTIFY MISSING GENES AND FIND UNIPROT CODES FOR THEM. NB SOME GENES MAY NOT BE PROTEIN CODING #
mart_table <- read.csv("martback.csv", header = T) #A table with the uniprot codes for the DEGs
uniprot_gene <- mart_table$uniprotswissprot
DEG_PPI <- subset(PPI, PPI$V1 %in% uniprot_gene | PPI$V2 %in% uniprot_gene)
rownames(DEG_PPI) <- 1:nrow(DEG_PPI)
write.csv(DEG_PPI, "DEG_PPI_ALP.csv", row.names = F)
DEG_PPI <- read.csv("DEG_PPI_ALP.csv")
DEG_PPI <- subset(DEG_PPI, DEG_PPI$Gene1 !="-")
DEG_PPI <- subset(DEG_PPI, DEG_PPI$Gene2 !="-")
write.csv(DEG_PPI, "FinalPDPPI.csv", row.names = F, quote = F)
From this we have a network of 1367 interactions between 1034 proteins (file = PD_DEGPPI.R, DEG_PPI_ALP.csv, ALP_PPIgenes.txt. Gene names used).
Next I had to find the corresonding rows in each of the RNA expression data files ready for the correlation anaylysis. The LRRK2 samples were not included in the coexpression analysis because with only 2 and 3 patients respectively, it would results in rho values of 1/-1 or 1, 0.5, 0, -0.5 or 1, which are difficult to threshold. I also removed the MOR.FC dataset as the distribution of Rho values was abnormally distributed
[]
#### ALP####
#Read in network nodes
DEG_PPI <- readLines("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/ALSPARK2LRRK2/ALP_PPIGenes.txt")
setwd("/users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression/")
#Extract PPI network genes from each dataset#####
LEW <- read.csv("LEWfilteredresult.csv")
rownames(LEW) <- LEW$Gene.Symbol
LEW <- LEW[,28:41]
LEW <- subset(LEW, rownames(LEW) %in% DEG_PPI)
MID3 <- read.csv("MID3filteredresult.csv")
rownames(MID3) <- MID3$Gene.Symbol
MID3 <- MID3[,36:49]
MID3 <- subset(MID3, rownames(MID3) %in% DEG_PPI)
MID4 <- read.csv("MID4filteredresult.csv")
rownames(MID4) <- MID4$Gene.Symbol
MID4 <- MID4[,41:55]
MID4 <- subset(MID4, rownames(MID4) %in% DEG_PPI)
MOR.FC <- read.csv("MOR.FCfilteredresult.csv")
rownames(MOR.FC) <- MOR.FC$Gene.Symbol
MOR.FC <- MOR.FC[,24:28]
MOR.FC <- subset(MOR.FC, rownames(MOR.FC) %in% DEG_PPI)
DIJ <- read.csv("DIJfilteredresult.csv")
rownames(DIJ) <- DIJ$Gene.Symbol
DIJ <- DIJ[,29:43]
DIJ <- subset(DIJ, rownames(DIJ) %in% DEG_PPI)
FFR <- read.csv("FFRfilteredresult.csv")
rownames(FFR) <- FFR$Gene.Symbol
FFR <- FFR[,29:44]
FFR <- subset(FFR, rownames(FFR) %in% DEG_PPI)
MID1 <- read.csv("MID1filteredresult.csv")
rownames(MID1) <- MID1$Gene.Symbol
MID1 <- MID1[,28:37]
MID1 <- subset(MID1, rownames(MID1) %in% DEG_PPI)
MID2 <- read.csv("MID2filteredresult.csv")
rownames(MID2) <- MID2$Gene.Symbol
MID2 <- MID2[,39:49]
MID2 <- subset(MID2, rownames(MID2) %in% DEG_PPI)
MOR.SN <- read.csv("MOR.SNfilteredresult.csv")
rownames(MOR.SN) <- MOR.SN$Gene.Symbol
MOR.SN <- MOR.SN[,36:59]
MOR.SN <- subset(MOR.SN, rownames(MOR.SN) %in% DEG_PPI)
DUM <- read.csv("DUM_UniqueGene_DESeq2.csv")
rownames(DUM) <- DUM$hgnc_symbol
DUM <- DUM[,53:81]
DUM <- subset(DUM, rownames(DUM) %in% DEG_PPI)
#Find the gene names that all datasets have in common
DEG_com <- Reduce(intersect, list(rownames(DIJ), rownames(FFR),
rownames(LEW),rownames(MID1),rownames(MID2),
rownames(MID3),rownames(MID4),rownames(MOR.FC),
rownames(MOR.SN), rownames(DUM)))
This script resulted in 854 common names. Correlation analysis was conducted on sharc (/shared/hidelab2/user/mdp15cmg/Parkinsons/PD_Cor)
When the correlation relationships were downloaded and analysed, only 59 edges reached a consistent correlation value of over 0.1 or below -0.1 regardless of sign. This increases to 208 when the RNA-seq dataset is removed.
Talking to Kat we hypothesised that the Parkin Fibroblasts were perhaps driving away the signal too strongly. Also the number of samples was so small that it’s not necessarily giving a representative signal.
To compensate, I decided to go back to the blood samples and use some other samples from the AMA GSE99039 dataset which have familial mutations. There were 5 ATP13A2 cases, 15 Parkin cases, and 12 PINK1 cases. Two of the Parkin cases in the metadata didn’t have corresponding files, so this became 13 Parkin cases.
##### AMA ATP13A2 #########
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Data/GSE99039_keepfiles/")
#run program to choose .CEL files from directory
celfiles <- fileBrowser(textToShow = "Choose CEL files", testFun = hasSuffix("[cC][eE][lL]"))
Data<-ReadAffy(filenames=celfiles) #read in files
rmaEset<-rma(Data) #normalise using RMA
analysis.name<-"AMA_ATP13A2" #Label analysis
dataMatrixAll<-exprs(rmaEset) #takes expression from normalised expression set
#mas5call generates presence/absence calls for each probeset
mas5call<-mas5calls(Data)
callMatrixAll<-exprs(mas5call)
colnames(callMatrixAll)<-sub(".CEL", ".mas5-Detection", colnames(callMatrixAll),fixed=TRUE)
colnames(callMatrixAll)<-sub(".cel", ".mas5-Detection", colnames(callMatrixAll),fixed=TRUE)
callMatrixAll<-as.data.frame(callMatrixAll)
callMatrixAll$ProbeSetID<-rownames(callMatrixAll)
countPf<-function(x){
sum(x=="P")
}
#count how many samples have presence calls
countPl<-apply(callMatrixAll, 1, countPf)
callMatrixAll$ProbeSetID<-rownames(callMatrixAll)
countPdf<-data.frame(ProbeSetID=names(countPl), countP=countPl)
#read annotation
# USING ANNOTATION FILE (if .csv, convert to .txt using excel)
annotation.file<-"/Users/clairegreen/Documents/PhD/TDP-43/TDP-43_Data/HG-U133_Plus_2.na35.annot.csv/HG-U133_Plus_2.na35_SHORT.annot.txt"
annotation<-read.table(annotation.file, header = TRUE, row.names=NULL, sep="\t", skip=0, stringsAsFactors=F, quote = "", comment.char="!", fill = TRUE, as.is = TRUE)
dim(annotation)
nrow(annotation)
annotation<-subset( annotation, subset=(Gene.Symbol !="---")) #if no gene symbol, discount
# Remove rows in which genes are noted to have negative strand matching probes
idxNegativeStrand<-grep("Negative Strand Matching Probes", annotation$Annotation.Notes)
if(length(idxNegativeStrand)>0)
{
annotation<-annotation[-idxNegativeStrand,]
}
expressionMatrix<-exprs(rmaEset)
colnames(expressionMatrix)
#this is for matched samples
Treat<-factor(rep(c("Control", "Patient"),c(183,5)), levels=c("Control", "Patient"))
design<-model.matrix(~Treat)
rownames(design)<-colnames(expressionMatrix)
design
#Conduct statistical analysis of expression
library(limma)
fit<-lmFit(expressionMatrix, design) #linear model fit
fit<-eBayes(fit)
result<-topTable(fit, coef="TreatPatient", adjust="BH", number=nrow(expressionMatrix)) #"BH" adjust for multiple hypothesis testing
#toptable normally takes top number but this takes all
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression")
write.csv(result, file = paste(analysis.name, "_result.csv", sep = ""))
result$"ProbeSetID"<-rownames(result) #make probeset IDs the row names
head(result$"ProbeSetID")
result$"Fold Change"<-2^result$logFC
result$"Fold Change"[result$"Fold Change"<1]<-(-1)/result$"Fold Change"[result$"Fold Change"<1] #converts log fold change into a linear value above or below 0
expressionLinear<-as.data.frame(2^expressionMatrix)
expressionLinear$ProbeSetID<-rownames(expressionLinear)
result<-merge(result, expressionLinear, by.x="ProbeSetID", by.y="ProbeSetID") #merge values into one array
result<-merge(annotation, result, by.x="Probe.Set.ID", by.y="ProbeSetID")
result<-merge(result, countPdf, by.x="Probe.Set.ID", by.y="ProbeSetID")
result$Gene.Symbol <- sapply(strsplit(result$Gene.Symbol,"///"), `[`, 1)
result$Ensembl <- sapply(strsplit(result$Ensembl,"///"), `[`, 1)
setwd(dir = "/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression/")
write.csv(result, file=paste(analysis.name, "_finalresult.csv", sep=""), row.names=FALSE, quote = FALSE)
genesort <- result[order(result$P.Value),]
uniqueresult <- genesort[!duplicated(genesort[,5]),]
write.csv(uniqueresult, file=paste(analysis.name, "rankeduniqueresult.csv", sep=""), row.names=FALSE, quote = FALSE)
uniqueresult<-subset(uniqueresult, Gene.Symbol!="") #removes any probes for which there are no gene symbols
uniqueresult<-subset(uniqueresult, subset=(countP>20)) #only takes results that have at least 2 samples with a presence call for a probe
write.csv(uniqueresult, file=paste(analysis.name, "filteredresult.csv", sep=""), row.names=FALSE, quote = FALSE)
##### AMA PINK1 ##########
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Data/GSE99039_keepfiles/")
#run program to choose .CEL files from directory
celfiles <- fileBrowser(textToShow = "Choose CEL files", testFun = hasSuffix("[cC][eE][lL]"))
Data<-ReadAffy(filenames=celfiles) #read in files
rmaEset<-rma(Data) #normalise using RMA
analysis.name<-"AMA_PINK1" #Label analysis
dataMatrixAll<-exprs(rmaEset) #takes expression from normalised expression set
#mas5call generates presence/absence calls for each probeset
mas5call<-mas5calls(Data)
callMatrixAll<-exprs(mas5call)
colnames(callMatrixAll)<-sub(".CEL", ".mas5-Detection", colnames(callMatrixAll),fixed=TRUE)
colnames(callMatrixAll)<-sub(".cel", ".mas5-Detection", colnames(callMatrixAll),fixed=TRUE)
callMatrixAll<-as.data.frame(callMatrixAll)
callMatrixAll$ProbeSetID<-rownames(callMatrixAll)
countPf<-function(x){
sum(x=="P")
}
#count how many samples have presence calls
countPl<-apply(callMatrixAll, 1, countPf)
callMatrixAll$ProbeSetID<-rownames(callMatrixAll)
countPdf<-data.frame(ProbeSetID=names(countPl), countP=countPl)
#read annotation
# USING ANNOTATION FILE (if .csv, convert to .txt using excel)
annotation.file<-"/Users/clairegreen/Documents/PhD/TDP-43/TDP-43_Data/HG-U133_Plus_2.na35.annot.csv/HG-U133_Plus_2.na35_SHORT.annot.txt"
annotation<-read.table(annotation.file, header = TRUE, row.names=NULL, sep="\t", skip=0, stringsAsFactors=F, quote = "", comment.char="!", fill = TRUE, as.is = TRUE)
dim(annotation)
nrow(annotation)
annotation<-subset( annotation, subset=(Gene.Symbol !="---")) #if no gene symbol, discount
# Remove rows in which genes are noted to have negative strand matching probes
idxNegativeStrand<-grep("Negative Strand Matching Probes", annotation$Annotation.Notes)
if(length(idxNegativeStrand)>0)
{
annotation<-annotation[-idxNegativeStrand,]
}
expressionMatrix<-exprs(rmaEset)
colnames(expressionMatrix)
#this is for matched samples
Treat<-factor(rep(c("Control", "Patient"),c(183,12)), levels=c("Control", "Patient"))
design<-model.matrix(~Treat)
rownames(design)<-colnames(expressionMatrix)
design
#Conduct statistical analysis of expression
library(limma)
fit<-lmFit(expressionMatrix, design) #linear model fit
fit<-eBayes(fit)
result<-topTable(fit, coef="TreatPatient", adjust="BH", number=nrow(expressionMatrix)) #"BH" adjust for multiple hypothesis testing
#toptable normally takes top number but this takes all
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression")
write.csv(result, file = paste(analysis.name, "_result.csv", sep = ""))
result$"ProbeSetID"<-rownames(result) #make probeset IDs the row names
head(result$"ProbeSetID")
result$"Fold Change"<-2^result$logFC
result$"Fold Change"[result$"Fold Change"<1]<-(-1)/result$"Fold Change"[result$"Fold Change"<1] #converts log fold change into a linear value above or below 0
expressionLinear<-as.data.frame(2^expressionMatrix)
expressionLinear$ProbeSetID<-rownames(expressionLinear)
result<-merge(result, expressionLinear, by.x="ProbeSetID", by.y="ProbeSetID") #merge values into one array
result<-merge(annotation, result, by.x="Probe.Set.ID", by.y="ProbeSetID")
result<-merge(result, countPdf, by.x="Probe.Set.ID", by.y="ProbeSetID")
result$Gene.Symbol <- sapply(strsplit(result$Gene.Symbol,"///"), `[`, 1)
result$Ensembl <- sapply(strsplit(result$Ensembl,"///"), `[`, 1)
setwd(dir = "/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression/")
write.csv(result, file=paste(analysis.name, "_finalresult.csv", sep=""), row.names=FALSE, quote = FALSE)
genesort <- result[order(result$P.Value),]
uniqueresult <- genesort[!duplicated(genesort[,5]),]
write.csv(uniqueresult, file=paste(analysis.name, "rankeduniqueresult.csv", sep=""), row.names=FALSE, quote = FALSE)
uniqueresult<-subset(uniqueresult, Gene.Symbol!="") #removes any probes for which there are no gene symbols
uniqueresult<-subset(uniqueresult, subset=(countP>20)) #only takes results that have at least 2 samples with a presence call for a probe
write.csv(uniqueresult, file=paste(analysis.name, "filteredresult.csv", sep=""), row.names=FALSE, quote = FALSE)
##### AMA PARKIN ##########
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Data/GSE99039_keepfiles/")
#run program to choose .CEL files from directory
celfiles <- fileBrowser(textToShow = "Choose CEL files", testFun = hasSuffix("[cC][eE][lL]"))
Data<-ReadAffy(filenames=celfiles) #read in files
rmaEset<-rma(Data) #normalise using RMA
analysis.name<-"AMA_PARKIN" #Label analysis
dataMatrixAll<-exprs(rmaEset) #takes expression from normalised expression set
#mas5call generates presence/absence calls for each probeset
mas5call<-mas5calls(Data)
callMatrixAll<-exprs(mas5call)
colnames(callMatrixAll)<-sub(".CEL", ".mas5-Detection", colnames(callMatrixAll),fixed=TRUE)
colnames(callMatrixAll)<-sub(".cel", ".mas5-Detection", colnames(callMatrixAll),fixed=TRUE)
callMatrixAll<-as.data.frame(callMatrixAll)
callMatrixAll$ProbeSetID<-rownames(callMatrixAll)
countPf<-function(x){
sum(x=="P")
}
#count how many samples have presence calls
countPl<-apply(callMatrixAll, 1, countPf)
callMatrixAll$ProbeSetID<-rownames(callMatrixAll)
countPdf<-data.frame(ProbeSetID=names(countPl), countP=countPl)
#read annotation
# USING ANNOTATION FILE (if .csv, convert to .txt using excel)
annotation.file<-"/Users/clairegreen/Documents/PhD/TDP-43/TDP-43_Data/HG-U133_Plus_2.na35.annot.csv/HG-U133_Plus_2.na35_SHORT.annot.txt"
annotation<-read.table(annotation.file, header = TRUE, row.names=NULL, sep="\t", skip=0, stringsAsFactors=F, quote = "", comment.char="!", fill = TRUE, as.is = TRUE)
dim(annotation)
nrow(annotation)
annotation<-subset( annotation, subset=(Gene.Symbol !="---")) #if no gene symbol, discount
# Remove rows in which genes are noted to have negative strand matching probes
idxNegativeStrand<-grep("Negative Strand Matching Probes", annotation$Annotation.Notes)
if(length(idxNegativeStrand)>0)
{
annotation<-annotation[-idxNegativeStrand,]
}
expressionMatrix<-exprs(rmaEset)
colnames(expressionMatrix)
#this is for matched samples
Treat<-factor(rep(c("Control", "Patient"),c(183,13)), levels=c("Control", "Patient"))
design<-model.matrix(~Treat)
rownames(design)<-colnames(expressionMatrix)
design
#Conduct statistical analysis of expression
library(limma)
fit<-lmFit(expressionMatrix, design) #linear model fit
fit<-eBayes(fit)
result<-topTable(fit, coef="TreatPatient", adjust="BH", number=nrow(expressionMatrix)) #"BH" adjust for multiple hypothesis testing
#toptable normally takes top number but this takes all
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression")
write.csv(result, file = paste(analysis.name, "_result.csv", sep = ""))
result$"ProbeSetID"<-rownames(result) #make probeset IDs the row names
head(result$"ProbeSetID")
result$"Fold Change"<-2^result$logFC
result$"Fold Change"[result$"Fold Change"<1]<-(-1)/result$"Fold Change"[result$"Fold Change"<1] #converts log fold change into a linear value above or below 0
expressionLinear<-as.data.frame(2^expressionMatrix)
expressionLinear$ProbeSetID<-rownames(expressionLinear)
result<-merge(result, expressionLinear, by.x="ProbeSetID", by.y="ProbeSetID") #merge values into one array
result<-merge(annotation, result, by.x="Probe.Set.ID", by.y="ProbeSetID")
result<-merge(result, countPdf, by.x="Probe.Set.ID", by.y="ProbeSetID")
result$Gene.Symbol <- sapply(strsplit(result$Gene.Symbol,"///"), `[`, 1)
result$Ensembl <- sapply(strsplit(result$Ensembl,"///"), `[`, 1)
setwd(dir = "/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression/")
write.csv(result, file=paste(analysis.name, "_finalresult.csv", sep=""), row.names=FALSE, quote = FALSE)
genesort <- result[order(result$P.Value),]
uniqueresult <- genesort[!duplicated(genesort[,5]),]
write.csv(uniqueresult, file=paste(analysis.name, "rankeduniqueresult.csv", sep=""), row.names=FALSE, quote = FALSE)
uniqueresult<-subset(uniqueresult, Gene.Symbol!="") #removes any probes for which there are no gene symbols
uniqueresult<-subset(uniqueresult, subset=(countP>20)) #only takes results that have at least 2 samples with a presence call for a probe
write.csv(uniqueresult, file=paste(analysis.name, "filteredresult.csv", sep=""), row.names=FALSE, quote = FALSE)
#############################################################
################### FOLD CHANGE DEG #########################
#############################################################
setwd("/users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression/")
LEW <- read.csv("LEWfilteredresult.csv")
LEW <- LEW[order(LEW$P.Value),]
MID3 <- read.csv("MID3filteredresult.csv")
MID3 <- MID3[order(MID3$P.Value),]
MID4 <- read.csv("MID4filteredresult.csv")
MID4 <- MID4[order(MID4$P.Value),]
MOR.FC <- read.csv("MOR.FCfilteredresult.csv")
MOR.FC <- MOR.FC[order(MOR.FC$P.Value),]
DIJ <- read.csv("DIJfilteredresult.csv")
DIJ <- DIJ[order(DIJ$P.Value),]
FFR <- read.csv("FFRfilteredresult.csv")
FFR <- FFR[order(FFR$P.Value),]
MID1 <- read.csv("MID1filteredresult.csv")
MID1 <- MID1[order(MID1$P.Value),]
MID2 <- read.csv("MID2filteredresult.csv")
MID2 <- MID2[order(MID2$P.Value),]
MOR.SN <- read.csv("MOR.SNfilteredresult.csv")
MOR.SN <- MOR.SN[order(MOR.SN$P.Value),]
DUM <- read.csv("DUM_UniqueGene_DESeq2.csv")
DUM <- DUM[order(DUM$pvalue),]
BOT <- read.csv("BOTrankeduniqueresult.csv")
BOT <- BOT[order(BOT$P.Value),]
BOT2 <- read.csv("BOT2rankeduniqueresult.csv")
BOT2 <- BOT2[order(BOT2$P.Value),]
thresh <- 1
upLEW <- subset(LEW, LEW$Fold.Change >= thresh)
upLEWgene <- upLEW$Gene.Symbol
upMID3<- subset(MID3, MID3$Fold.Change >= thresh)
upMID3gene <- upMID3$Gene.Symbol
upMID4 <- subset(MID4, MID4$Fold.Change >= thresh)
upMID4gene <- upMID4$Gene.Symbol
upMOR.FC <- subset(MOR.FC, MOR.FC$Fold.Change >= thresh)
upMOR.FCgene <- upMOR.FC$Gene.Symbol
upDUM <- subset(DUM, DUM$log2FoldChange >= 0)
upDUMgene <- upDUM$hgnc_symbol
upDIJ <- subset(DIJ, DIJ$Fold.Change >= thresh)
upDIJgene <- upDIJ$Gene.Symbol
upFFR<- subset(FFR, FFR$Fold.Change >= thresh)
upFFRgene <- upFFR$Gene.Symbol
upMID1<- subset(MID1, MID1$Fold.Change >= thresh)
upMID1gene <- upMID1$Gene.Symbol
upMID2 <- subset(MID2, MID2$Fold.Change >= thresh)
upMID2gene <- upMID2$Gene.Symbol
upMOR.SN <- subset(MOR.SN, MOR.SN$Fold.Change >= thresh)
upMOR.SNgene <- upMOR.SN$Gene.Symbol
upBOT <- subset(BOT, BOT$Fold.Change >= thresh)
upBOTgene <- upBOT$Gene.Symbol
upBOT2 <- subset(BOT2, BOT2$Fold.Change >= thresh)
upBOT2gene <- upBOT2$Gene.Symbol
INTUP <- Reduce(intersect, list(upLEWgene, upMID3gene, upMID4gene, upMOR.FCgene,
upDIJgene, upFFRgene, upMID1gene, upMID2gene, upMOR.SNgene, upDUMgene, upBOTgene, upBOT2gene))
#### DOWN ####
thresh <- -1
downLEW <- subset(LEW, LEW$Fold.Change <= thresh)
downLEWgene <- downLEW$Gene.Symbol
downMID3 <- subset(MID3, MID3$Fold.Change <= thresh)
downMID3gene <- downMID3$Gene.Symbol
downMID4 <- subset(MID4, MID4$Fold.Change <= thresh)
downMID4gene <- downMID4$Gene.Symbol
downMOR.FC <- subset(MOR.FC, MOR.FC$Fold.Change <= thresh)
downMOR.FCgene <- downMOR.FC$Gene.Symbol
downDUM <- subset(DUM, DUM$log2FoldChange <= 0)
downDUMgene <- downDUM$hgnc_symbol
downDIJ <- subset(DIJ, DIJ$Fold.Change <= thresh)
downDIJgene <- downDIJ$Gene.Symbol
downFFR<- subset(FFR, FFR$Fold.Change <= thresh)
downFFRgene <- downFFR$Gene.Symbol
downMID1<- subset(MID1, MID1$Fold.Change <= thresh)
downMID1gene <- downMID1$Gene.Symbol
downMID2 <- subset(MID2, MID2$Fold.Change <= thresh)
downMID2gene <- downMID2$Gene.Symbol
downMOR.SN <- subset(MOR.SN, MOR.SN$Fold.Change <= thresh)
downMOR.SNgene <- downMOR.SN$Gene.Symbol
downBOT <- subset(BOT, BOT$Fold.Change <= thresh)
downBOTgene <- downBOT$Gene.Symbol
downBOT2 <- subset(BOT2, BOT2$Fold.Change <= thresh)
downBOT2gene <- downBOT2$Gene.Symbol
INTDOWN <- Reduce(intersect, list(downLEWgene, downMID3gene, downMID4gene, downMOR.FCgene,
downDIJgene, downFFRgene, downMID1gene, downMID2gene,
downMOR.SNgene, downDUMgene, downBOTgene, downBOT2gene))
########################### COMMON GENES ##############################
all <- c(INTUP, INTDOWN)
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression/FoldChange")
# write.table(INTUP,"Sp_LRRK2_upDEGs.txt", row.names = F, col.names = F, quote = F)
# write.table(INTDOWN,"Sp_LRRK2_downDEGs.txt", row.names = F, col.names = F, quote = F)
# write.table(all, "Sp_LRRK2_allDEGs.txt", row.names = F, col.names = F, quote = F)
PDgenes <- readLines("/Users/clairegreen/Documents/PhD/Parkinsons/ParkinsonsDiseaseMalacards.txt")
intersect(INTUP, PDgenes)
intersect(INTDOWN, PDgenes)
####### ALS signature ##########
setwd("/users/clairegreen/Documents/PhD/TDP-43/TDP-43_Code/Results/GeneExpression/noMedian/")
C9 <- read.csv("C9_unique.csv")
C9 <- C9[order(C9$P.Value),]
sals <- read.csv("sals_unique.csv")
sals <- sals[order(sals$P.Value),]
setwd("/users/clairegreen/Documents/PhD/TDP-43/TDP-43_Code/Results/GeneExpression/TDP-43_DEseq2/")
pet <- read.csv("PET_results_keepfiltering.csv")
rav <- read.csv("RAV_results_keepfiltering.csv")
setwd("/users/clairegreen/Documents/PhD/TDP-43/TDP-43_Code/Results/GeneExpression/non-TDP/")
FUS <- read.csv("FUSrankeduniqueresult.csv")
FUS <- FUS[order(FUS$P.Value),]
SOD1 <- read.csv("SOD1rankeduniqueresult.csv")
SOD1 <- SOD1[order(SOD1$P.Value),]
thresh <- 1
upC9 <- subset(C9, C9$Fold.Change >= thresh)
upC9gene <- upC9$Gene.Symbol
upSALS <- subset(sals, sals$Fold.Change >= thresh)
upSALSgene <- upSALS$Gene.Symbol
upPET <- subset(pet, pet$FoldChange >= thresh)
upPETgene <- upPET$hgnc_symbol
upRAV <- subset(rav, rav$FoldChange >= thresh)
upRAVgene <- upRAV$hgnc_symbol
upFUS <- subset(FUS, FUS$Fold.Change >= thresh)
upFUSgene <- upFUS$Gene.Symbol
upSOD1 <- subset(SOD1, SOD1$Fold.Change >= thresh)
upSOD1gene <- upSOD1$Gene.Symbol
INTUP_ALS <- Reduce(intersect, list(upC9gene, upSALSgene, upPETgene, upRAVgene, upFUSgene, upSOD1gene))
#### DOWN
thresh <- -1
downC9 <- subset(C9, C9$Fold.Change <= thresh)
downC9gene <- downC9$Gene.Symbol
downSALS <- subset(sals, sals$Fold.Change <= thresh)
downSALSgene <- downSALS$Gene.Symbol
downPET <- subset(pet, pet$FoldChange <= thresh)
downPETgene <- downPET$hgnc_symbol
downRAV <- subset(rav, rav$FoldChange <= thresh)
downRAVgene <- downRAV$hgnc_symbol
downFUS <- subset(FUS, FUS$Fold.Change <= thresh)
downFUSgene <- downFUS$Gene.Symbol
downSOD1 <- subset(SOD1, SOD1$Fold.Change <= thresh)
downSOD1gene <- downSOD1$Gene.Symbol
INTDOWN_ALS <- Reduce(intersect, list(downC9gene, downSALSgene, downPETgene, downRAVgene, downFUSgene, downSOD1gene))
##### COMMON GENES ###
upremove <- Reduce(intersect, list (INTUP, INTUP_ALS))
downremove <- Reduce(intersect, list(INTDOWN, INTDOWN_ALS))
###### REMOVE COMMON GENES ###
resultsup <- subset(INTUP, !(INTUP %in% upremove))
resultsdown <- subset(INTDOWN, !(INTDOWN %in% downremove))
results <- c(resultsup, resultsdown)
####### sPD Blood signature ##########
setwd("/users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression/")
AMA <- read.csv("AMAfilteredresult.csv")
AMA <- AMA[order(AMA$P.Value),]
RON <- read.csv("RONfilteredresult.csv")
RON <- RON[order(RON$P.Value),]
thresh <- 1
upAMA <- subset(AMA, AMA$Fold.Change >= thresh)
upAMAgene <- upAMA$Gene.Symbol
upRON <- subset(RON, RON$Fold.Change >= thresh)
upRONgene <- upRON$Gene.Symbol
INTUP_blood <- Reduce(intersect, list(upAMAgene, upRONgene))
#### DOWN ###
thresh <- -1
downAMA <- subset(AMA, AMA$Fold.Change <= thresh)
downAMAgene <- downAMA$Gene.Symbol
downRON <- subset(RON, RON$Fold.Change <= thresh)
downRONgene <- downRON$Gene.Symbol
INTDOWN_blood <- Reduce(intersect, list(downAMAgene, downRONgene))
##### COMMON GENES ###
upremove2 <- Reduce(intersect, list(INTUP, INTUP_blood))
downremove2 <- Reduce(intersect, list(INTDOWN, INTDOWN_blood))
##### REMOVE COMMON GENES ###
resultsup <- subset(resultsup, !(resultsup %in% upremove2))
resultsdown <- subset(resultsdown, !(resultsdown %in% downremove2))
results <- c(resultsup, resultsdown)
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression/FoldChange/")
# write.table(resultsup, "Sp_LRRK2_FiltALSPDblood_UPgenes.txt", quote = F, row.names = F, col.names = F)
# write.table(resultsdown, "Sp_LRRK2_FiltALSPDblood_DOWNgenes.txt", quote = F, row.names = F, col.names = F)
# write.table(results, "Sp_LRRK2_FiltALSPDblood_ALLgenes.txt", quote = F, row.names = F, col.names = F)
# cat(resultsup, sep="\n")
intersect(resultsup, PDgenes)
intersect(resultsdown, PDgenes)
####### fPD Blood signature ##########
setwd("/users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression/")
AMA_A <- read.csv("AMA_ATP13A2filteredresult.csv")
AMA_A <- AMA_A[order(AMA_A$P.Value),]
AMA_PARK <- read.csv("AMA_PARKINfilteredresult.csv")
AMA_PARK <- AMA_PARK[order(AMA_PARK$P.Value),]
AMA_PINK <- read.csv("AMA_PINK1filteredresult.csv")
AMA_PINK <- AMA_PINK[order(AMA_PINK$P.Value),]
thresh <- 1
upAMA_A <- subset(AMA_A, AMA_A$Fold.Change >= thresh)
upAMA_Agene <- AMA_A$Gene.Symbol
upAMA_PARK<- subset(AMA_PARK, AMA_PARK$Fold.Change >= thresh)
upAMA_PARKgene <- upAMA_PARK$Gene.Symbol
upAMA_PINK<- subset(AMA_PINK, AMA_PINK$Fold.Change >= thresh)
upAMA_PINKgene <- upAMA_PINK$Gene.Symbol
INTUP_fam_blood <- Reduce(intersect, list(upAMA_Agene, upAMA_PARKgene, upAMA_PINKgene))
#### DOWN ###
thresh <- -1
downAMA_A <- subset(AMA_A, AMA_A$Fold.Change <= thresh)
downAMA_Agene <- AMA_A$Gene.Symbol
downAMA_PARK<- subset(AMA_PARK, AMA_PARK$Fold.Change <= thresh)
downAMA_PARKgene <- downAMA_PARK$Gene.Symbol
downAMA_PINK<- subset(AMA_PINK, AMA_PINK$Fold.Change <= thresh)
downAMA_PINKgene <- downAMA_PINK$Gene.Symbol
INTDOWN_fam_blood <- Reduce(intersect, list(downAMA_Agene, downAMA_PARKgene, downAMA_PINKgene))
##### COMMON GENES ###
upremove3 <- Reduce(intersect, list(INTUP, INTUP_fam_blood))
downremove3 <- Reduce(intersect, list(INTDOWN, INTDOWN_fam_blood))
##### REMOVE COMMON GENES ###
resultsup <- subset(resultsup, !(resultsup %in% upremove3))
resultsdown <- subset(resultsdown, !(resultsdown %in% downremove3))
results <- c(resultsup, resultsdown)
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/FamilialBlood/")
write.table(resultsup, "ALS_sfblood_UPgenes.txt", quote = F, row.names = F, col.names = F)
write.table(resultsdown, "ALS_sfblood_DOWNgenes.txt", quote = F, row.names = F, col.names = F)
write.table(results, "ALS_sfblood_ALLgenes.txt", quote = F, row.names = F, col.names = F)
cat(resultsup, sep="\n")
intersect(resultsup, PDgenes)
intersect(resultsdown, PDgenes)
This produced 170 commonly differentially expressed genes (ALS_sfblood_ALLgenes.txt). These genes included Malacards SCNA, MSX1, UCHL1, ALDH1A1 .
The 170 gene seed created a network of 2159 proteins (/familialblood/PPIgenes). EnrichR (http://amp.pharm.mssm.edu/Enrichr/enrich?dataset=3ztlg) shows high enrichment for a number of processes, making it hard to know exactly what the geneset represents. However for GO Cellular component the highest enrichments are still for the mitochondria
In this case, I am somewhat confident that I am looking in the right area. The only way we will know if it is, however, is to look at whether there is common coexpression
#### Parkinson's DEG PPI Correlation ####
#Read in network nodes
#### Familial Blood ####
#Read in network nodes
DEG_PPI <- readLines("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/FamilialBlood/PPIGenes.txt")
setwd("/users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/GeneExpression/")
#Extract PPI network genes from each dataset#####
LEW <- read.csv("LEWfilteredresult.csv")
rownames(LEW) <- LEW$Gene.Symbol
LEW <- LEW[,28:41]
LEW <- subset(LEW, rownames(LEW) %in% DEG_PPI)
MID3 <- read.csv("MID3filteredresult.csv")
rownames(MID3) <- MID3$Gene.Symbol
MID3 <- MID3[,36:49]
MID3 <- subset(MID3, rownames(MID3) %in% DEG_PPI)
MID4 <- read.csv("MID4filteredresult.csv")
rownames(MID4) <- MID4$Gene.Symbol
MID4 <- MID4[,41:55]
MID4 <- subset(MID4, rownames(MID4) %in% DEG_PPI)
MOR.FC <- read.csv("MOR.FCfilteredresult.csv")
rownames(MOR.FC) <- MOR.FC$Gene.Symbol
MOR.FC <- MOR.FC[,24:28]
MOR.FC <- subset(MOR.FC, rownames(MOR.FC) %in% DEG_PPI)
DIJ <- read.csv("DIJfilteredresult.csv")
rownames(DIJ) <- DIJ$Gene.Symbol
DIJ <- DIJ[,29:43]
DIJ <- subset(DIJ, rownames(DIJ) %in% DEG_PPI)
FFR <- read.csv("FFRfilteredresult.csv")
rownames(FFR) <- FFR$Gene.Symbol
FFR <- FFR[,29:44]
FFR <- subset(FFR, rownames(FFR) %in% DEG_PPI)
MID1 <- read.csv("MID1filteredresult.csv")
rownames(MID1) <- MID1$Gene.Symbol
MID1 <- MID1[,28:37]
MID1 <- subset(MID1, rownames(MID1) %in% DEG_PPI)
MID2 <- read.csv("MID2filteredresult.csv")
rownames(MID2) <- MID2$Gene.Symbol
MID2 <- MID2[,39:49]
MID2 <- subset(MID2, rownames(MID2) %in% DEG_PPI)
MOR.SN <- read.csv("MOR.SNfilteredresult.csv")
rownames(MOR.SN) <- MOR.SN$Gene.Symbol
MOR.SN <- MOR.SN[,36:59]
MOR.SN <- subset(MOR.SN, rownames(MOR.SN) %in% DEG_PPI)
DUM <- read.csv("DUM_UniqueGene_DESeq2.csv")
rownames(DUM) <- DUM$hgnc_symbol
DUM <- DUM[,53:81]
DUM <- subset(DUM, rownames(DUM) %in% DEG_PPI)
#Find the gene names that all datasets have in common
DEG_com <- Reduce(intersect, list(rownames(DIJ), rownames(FFR),
rownames(LEW),rownames(MID1),rownames(MID2),
rownames(MID3),rownames(MID4),rownames(MOR.FC),
rownames(MOR.SN), rownames(DUM)))
#Subset each dataset with these common names so they are all the same size
DIJ <- subset(DIJ, rownames(DIJ) %in% DEG_com)
DUM <- subset(DUM, rownames(DUM) %in% DEG_com)
FFR <- subset(FFR, rownames(FFR) %in% DEG_com)
LEW <- subset(LEW, rownames(LEW) %in% DEG_com)
MID1 <- subset(MID1, rownames(MID1) %in% DEG_com)
MID2 <- subset(MID2, rownames(MID2) %in% DEG_com)
MID3 <- subset(MID3, rownames(MID3) %in% DEG_com)
MID4 <- subset(MID4, rownames(MID4) %in% DEG_com)
MOR.FC <- subset(MOR.FC, rownames(MOR.FC) %in% DEG_com)
MOR.SN <- subset(MOR.SN, rownames(MOR.SN) %in% DEG_com)
setwd("/users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/FamilialBlood")
write.csv(DIJ, "DIJ_DEGfilt.csv")
write.csv(DUM, "DUM_DEGfilt.csv")
write.csv(FFR, "FFR_DEGfilt.csv")
write.csv(LEW, "LEW_DEGfilt.csv")
write.csv(MID1, "MID1_DEGfilt.csv")
write.csv(MID2, "MID2_DEGfilt.csv")
write.csv(MID3, "MID3_DEGfilt.csv")
write.csv(MID4, "MID4_DEGfilt.csv")
write.csv(MOR.FC, "MOR.FC_DEGfilt.csv")
write.csv(MOR.SN, "MOR.SN_DEGfilt.csv")
#Run Correlation analysis on Sharc
#### filter correlations #### BOT/BOT2 not included due to less than 4 samples
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/FamilialBlood/")
DIJ <- read.csv("DIJcorresult.csv")
DIJ$Gene1 <- as.character(lapply(strsplit(as.character(DIJ$X), "\\:"), "[", 2))
DIJ$Gene2 <- as.character(lapply(strsplit(as.character(DIJ$X), "\\:"), "[", 1))
DIJ <- DIJ[,c(5,6,1,2,3,4)]
DUM <- read.csv("DUMcorresult.csv")
DUM$Gene1 <- as.character(lapply(strsplit(as.character(DUM$X), "\\:"), "[", 2))
DUM$Gene2 <- as.character(lapply(strsplit(as.character(DUM$X), "\\:"), "[", 1))
DUM <- DUM[,c(5,6,1,2,3,4)]
DUM$X <- paste(DUM$Gene1,":",DUM$Gene2, sep = "")
FFR <- read.csv("FFRcorresult.csv")
FFR$Gene1 <- as.character(lapply(strsplit(as.character(FFR$X), "\\:"), "[", 2))
FFR$Gene2 <- as.character(lapply(strsplit(as.character(FFR$X), "\\:"), "[", 1))
FFR <- FFR[,c(5,6,1,2,3,4)]
FFR$X <- paste(FFR$Gene1,":",FFR$Gene2, sep = "")
LEW <- read.csv("LEWcorresult.csv")
LEW$Gene1 <- as.character(lapply(strsplit(as.character(LEW$X), "\\:"), "[", 2))
LEW$Gene2 <- as.character(lapply(strsplit(as.character(LEW$X), "\\:"), "[", 1))
LEW <- LEW[,c(5,6,1,2,3,4)]
MID1 <- read.csv("MID1corresult.csv")
MID1$Gene1 <- as.character(lapply(strsplit(as.character(MID1$X), "\\:"), "[", 2))
MID1$Gene2 <- as.character(lapply(strsplit(as.character(MID1$X), "\\:"), "[", 1))
MID1 <- MID1[,c(5,6,1,2,3,4)]
MID1$X <- paste(MID1$Gene1,":",MID1$Gene2, sep = "")
MID2 <- read.csv("MID2corresult.csv")
MID2$Gene1 <- as.character(lapply(strsplit(as.character(MID2$X), "\\:"), "[", 2))
MID2$Gene2 <- as.character(lapply(strsplit(as.character(MID2$X), "\\:"), "[", 1))
MID2 <- MID2[,c(5,6,1,2,3,4)]
MID3 <- read.csv("MID3corresult.csv")
MID3$Gene1 <- as.character(lapply(strsplit(as.character(MID3$X), "\\:"), "[", 2))
MID3$Gene2 <- as.character(lapply(strsplit(as.character(MID3$X), "\\:"), "[", 1))
MID3 <- MID3[,c(5,6,1,2,3,4)]
MID4 <- read.csv("MID4corresult.csv")
MID4$Gene1 <- as.character(lapply(strsplit(as.character(MID4$X), "\\:"), "[", 2))
MID4$Gene2 <- as.character(lapply(strsplit(as.character(MID4$X), "\\:"), "[", 1))
MID4 <- MID4[,c(5,6,1,2,3,4)]
MID4$X <- paste(MID4$Gene1,":",MID4$Gene2, sep = "")
MOR.FC<- read.csv("MOR.FCcorresult.csv")
MOR.FC$Gene1 <- as.character(lapply(strsplit(as.character(MOR.FC$X), "\\:"), "[", 2))
MOR.FC$Gene2 <- as.character(lapply(strsplit(as.character(MOR.FC$X), "\\:"), "[", 1))
MOR.FC <- MOR.FC[,c(5,6,1,2,3,4)]
MOR.SN<- read.csv("MOR.SNcorresult.csv")
MOR.SN$Gene1 <- as.character(lapply(strsplit(as.character(MOR.SN$X), "\\:"), "[", 2))
MOR.SN$Gene2 <- as.character(lapply(strsplit(as.character(MOR.SN$X), "\\:"), "[", 1))
MOR.SN <- MOR.SN[,c(5,6,1,2,3,4)]
thresh <- 0.1
### Filter by r value
DIJ_cor.5 <- DIJ[DIJ$reg.mat > thresh | DIJ$reg.mat < -thresh,]
# DUM_cor.5 <- DUM[DUM$reg.mat > thresh | DUM$reg.mat < -thresh,]
FFR_cor.5 <- FFR[FFR$reg.mat > thresh | FFR$reg.mat < -thresh,]
LEW_cor.5 <- LEW[LEW$reg.mat > thresh | LEW$reg.mat < -thresh,]
MID1_cor.5 <- MID1[MID1$reg.mat > thresh | MID1$reg.mat < -thresh,]
MID2_cor.5 <- MID2[MID2$reg.mat > thresh | MID2$reg.mat < -thresh,]
MID3_cor.5 <- MID3[MID3$reg.mat > thresh | MID3$reg.mat < -thresh,]
MID4_cor.5 <- MID4[MID4$reg.mat > thresh | MID4$reg.mat < -thresh,]
# MOR.FC_cor.5 <- MOR.FC[MOR.FC$reg.mat > thresh | MOR.FC$reg.mat < -thresh,]
MOR.SN_cor.5 <- MOR.SN[MOR.SN$reg.mat > thresh | MOR.SN$reg.mat < -thresh,]
### Find matches
Commonedge <- Reduce(intersect, list(DIJ_cor.5$X,FFR_cor.5$X,
LEW_cor.5$X, MID1_cor.5$X, MID2_cor.5$X,
MID3_cor.5$X, MID4_cor.5$X,
MOR.SN_cor.5$X))
#Subset each dataset with these common names so they are all the same size
DIJ_CE <- subset(DIJ_cor.5, DIJ_cor.5$X %in% Commonedge)
DIJ_CE <- DIJ_CE[order(DIJ_CE$X),]
# DUM_CE <- subset(DUM_cor.5, DUM_cor.5$X %in% Commonedge)
# DUM_CE <- DUM_CE[order(DUM_CE$X),]
FFR_CE <- subset(FFR_cor.5, FFR_cor.5$X %in% Commonedge)
FFR_CE <- FFR_CE[order(FFR_CE$X),]
LEW_CE <- subset(LEW_cor.5, LEW_cor.5$X %in% Commonedge)
LEW_CE <- LEW_CE[order(LEW_CE$X),]
MID1_CE <- subset(MID1_cor.5, MID1_cor.5$X %in% Commonedge)
MID1_CE <- MID1_CE[order(MID1_CE$X),]
MID2_CE <- subset(MID2_cor.5, MID2_cor.5$X %in% Commonedge)
MID2_CE <- MID2_CE[order(MID2_CE$X),]
MID3_CE <- subset(MID3_cor.5, MID3_cor.5$X %in% Commonedge)
MID3_CE <- MID3_CE[order(MID3_CE$X),]
MID4_CE <- subset(MID4_cor.5, MID4_cor.5$X %in% Commonedge)
MID4_CE <- MID4_CE[order(MID4_CE$X),]
# MOR.FC_CE <- subset(MOR.FC_cor.5, MOR.FC_cor.5$X %in% Commonedge)
# MOR.FC_CE <- MOR.FC_CE[order(MOR.FC_CE$X),]
MOR.SN_CE <- subset(MOR.SN_cor.5, MOR.SN_cor.5$X %in% Commonedge)
MOR.SN_CE <- MOR.SN_CE[order(MOR.SN_CE$X),]
CommonGroup <- data.frame(row.names = DIJ_CE$X,
DIJ = DIJ_CE$reg.mat,
FFR = FFR_CE$reg.mat,
LEW = LEW_CE$reg.mat,
MID1 = MID1_CE$reg.mat,
MID2 = MID2_CE$reg.mat,
MID3 = MID3_CE$reg.mat,
MID4 = MID4_CE$reg.mat,
MOR.SN = MOR.SN_CE$reg.mat)
CG_conserved_up <- CommonGroup[apply(CommonGroup, MARGIN = 1, function(x) all(x > 0)), ]
CG_conserved_down <- CommonGroup[apply(CommonGroup, MARGIN = 1, function(x) all(x < 0)), ]
setwd("/Users/clairegreen/Documents/PhD/Parkinsons/Parkinsons_Code/Results/FamilialBlood/")
CG_samedir <- rbind(CG_conserved_up, CG_conserved_down)
CG_samedir$corMean <- rowMeans(CG_samedir, na.rm = FALSE, dims = 1)
CG_samedir$Gene <- rownames(CG_samedir)
CG_samedir$Gene1 <- as.character(lapply(strsplit(as.character(CG_samedir$Gene), "\\:"), "[", 2))
CG_samedir$Gene2 <- as.character(lapply(strsplit(as.character(CG_samedir$Gene), "\\:"), "[", 1))
write.csv(CG_samedir, "PD_samedir_1.csv", quote = F, row.names = F)
nodetable <- read.csv("PD_samedir_1node.csv")
nuggenes <- as.character(nodetable$name)
PDgenes <- readLines("/Users/clairegreen/Documents/PhD/Parkinsons/ParkinsonsDiseaseMalacards.txt")
celltype <- read.csv("~/Documents/PhD/TDP-43/TDP-43_Code/Results/PPI_Network/Zhang_BrainCelltype_Markers_braingenes.csv")
DEG <- readLines("ALS_sfblood_ALLgenes.txt")
nalls <- readLines("/Users/clairegreen/Documents/PhD/Parkinsons/NallsPDGWAS.txt")
targetvalPD <- c("SIRT2", "DNM1L", "STMN1", "DNAJA3", "TBK1", "RTCA", "ANXA7", "DNAJC12", "RTN4", "ADSL", "MDH1","ATP6V1G2",
"YWHAZ", "ETS1")
nodetable$PDMalacards <- nodetable$shared.name %in% PDgenes
nodetable$DEG <- nodetable$name %in% DEG
nodetable$targetvalPD <- nodetable$shared.name %in% targetvalPD
nodetable$targetvalPD <- nodetable$shared.name %in% nalls
nodetable_celltype <- merge(celltype, nodetable, by.x = "Gene.symbol", by.y = "shared.name", all = T)
nodetable_celltype <- subset(nodetable_celltype, !(nodetable_celltype$SUID == "NA"))
nodetable_celltype$targetvalPD <- nodetable_celltype$Gene.symbol %in% targetvalPD
write.csv(nodetable_celltype, "ModifiedPDNodetable.csv", row.names = F)
This analysis didn’t go as well as the TDP-43 case. The correlations were much more variable meaning I had to reduce the thresholds from 0.5 to 0.1. Additionally, I removed four datasets - the LRRK2 datasets couldn’t be used because they only had 2 and 3 samples - correlation would never calculated properly. DUM was an RNA-seq dataset so I was losing annotation coverage, and MOR.FC showed a really odd rho value distribution…
Looking back this is probably because it only has 3 control samples.
Using these 8 datasets, with a threshold of 0.1, This generated a network containing 279 nodes and 292 edges. Within this network was one clear largest connected component of 208 nodes and 249 edges.
If I increase the threshold to 0.2, the resulting network is much smaller: